Quality-Based Similarity Search for Biological Sequence Databases

نویسندگان

  • Xuehui Li
  • Tamer Kahveci
چکیده

Low-Complexity Regions (LCRs) of biological sequences are the main source of false positives in similarity searches for biological sequence databases. We consider the problem of finding similar sequences when the locations of the LCRs are not known precisely. We develop a formulation to measure the quality of each letter in a sequence. The quality value of a letter is the probability for that letter to be in a non-LCR. We show that the quality values can be employed in two fundamental approaches to the sequence search problem to reduce the number of false positives produced by them significantly. The former finds the optimal alignment of two sequences using dynamic programming. The latter computes a suboptimal alignment using hash table. For the latter one, we also develop a randomized memory-resident hash table that indexes k-grams (sequences of length k) probabilistically. The kgrams that are likely to contain LCRs are indexed with lower probabilities. As a result, memory usage and CPU cost are greatly reduced. We also show that this hash table can be used to reconstruct query sequences with negligible information loss. This eliminates the need to store these sequences. Our experiments on real data show that our quality-based similarity search algorithms reduce the number of false positives drastically. In addition, their running times were better than the existing strategies.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Similarity Search Using Pre-Search in UniRef100 Database

Sequence similarity in biological databases is used to characterize a newly discovered protein and confirming the existence of its homologs. This is often computationally very expensive. We have implemented a new algorithm that performs sequence similarity search using a pre-search phase. The proposed algorithm works in three phases. As a prepreparation for Pre-Search, we locate a sequence, sim...

متن کامل

KEGG and DBGET/LinkDB: Integration of Biological Relationships in Divergent Molecular Biology Data

A simple formulation to integrate various biological data is presented based on the concept of links, which are classified into three types: factual, similarity, and biological. Factual links are crossreference information of entries among molecular biology databases. Similarity links are neighbor information of sequence entries computed by sequence similarity search programs. Biological links ...

متن کامل

BLAST2SRS, a web server for flexible retrieval of related protein sequences in the SWISS-PROT and SPTrEMBL databases

SRS (Sequence Retrieval System) is a widely used keyword search engine for querying biological databases. BLAST2 is the most widely used tool to query databases by sequence similarity search. These tools allow users to retrieve sequences by shared keyword or by shared similarity, with many public web servers available. However, with the increasingly large datasets available it is now quite comm...

متن کامل

The Annotation-enriched non-redundant patent sequence databases

The EMBL-European Bioinformatics Institute (EMBL-EBI) offers public access to patent sequence data, providing a valuable service to the intellectual property and scientific communities. The non-redundant (NR) patent sequence databases comprise two-level nucleotide and protein sequence clusters (NRNL1, NRNL2, NRPL1 and NRPL2) based on sequence identity (level-1) and patent family (level-2). Anno...

متن کامل

Similarity Search in Moving Object Trajectories

The continuous and rapid advent in mobile and communications technology opens the way for new research areas and new applications. Moving Object Databases(MODs) are among the emerging research topics that are attracting many work due to their vital need in many applications. Generally, MODs deal with geometries changing over time. In this paper we study an interesting point in moving object dat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007